Conversation
|
Can one of the admins verify this patch? |
include/cuco/priority_queue.cuh
Outdated
| namespace cuco { | ||
|
|
||
| /* | ||
| * @brief A GPU-accelerated priority queue of key-value pairs |
There was a problem hiding this comment.
Is there a reason for this to be hardcoded for key-value pairs? Can't it be for any trivially copyable type T? e.g., with std::priority_queue I could have a std::priority_queue<int> or a std::priority_queue<std::pair<int,int>>.
There was a problem hiding this comment.
Update docs now that this has been updated.
|
I reviewed the top level header at this point and gave some thoughts/questions on how to make this a little more generic. |
|
ok to test |
|
@PointKernel Thanks for your comments! I believe that I have addressed or responded to them all. Please let me know what you think and what other comments you might have. |
PointKernel
left a comment
There was a problem hiding this comment.
Another round of review.
Thanks @andrewbriand for your effort and persistence made to this PR! We are almost there.
| ~priority_queue(); | ||
|
|
||
| class device_mutable_view { | ||
| public: |
There was a problem hiding this comment.
| public: | |
| public: | |
| using value_type = T; |
There was a problem hiding this comment.
Should I also replace references to T with value_type in device_mutable_view?
| detail::push_kernel<<<num_blocks, block_size, get_shmem_size(block_size), stream>>>( | ||
| first, | ||
| last - first, | ||
| d_heap_, | ||
| d_size_, | ||
| node_size_, | ||
| d_locks_, | ||
| d_p_buffer_size_, | ||
| lowest_level_start_, | ||
| compare_); |
There was a problem hiding this comment.
| detail::push_kernel<<<num_blocks, block_size, get_shmem_size(block_size), stream>>>( | |
| first, | |
| last - first, | |
| d_heap_, | |
| d_size_, | |
| node_size_, | |
| d_locks_, | |
| d_p_buffer_size_, | |
| lowest_level_start_, | |
| compare_); | |
| auto view = get_device_mutable_view(); | |
| detail::push_kernel<<<num_blocks, block_size, get_shmem_size(block_size), stream>>>( | |
| first, num_elements, view); |
This is a great example showing the power of "view". Accordingly, the push_kernel would look like:
template <typename OutputIt, typename viewT>
__global__ void push_kernel(OutputIt elements,
std::size_t const num_elements,
viewT view)
{
using T = typename viewT::value_type;
...
}If you want, push_n_kernel instead of push_kernel would be a more descriptive name in this case.
| detail::push_single_node(g, | ||
| first + i * node_size_, | ||
| d_heap_, | ||
| d_size_, | ||
| node_size_, | ||
| d_locks_, | ||
| lowest_level_start_, | ||
| shmem, | ||
| compare_); |
There was a problem hiding this comment.
push_single_node, push_partial_node, and related utilities should be member functions of device_mutable_view.
There was a problem hiding this comment.
The same as pop_single_node and pop_partial_node
| /* | ||
| * @brief Return the amount of temporary storage required for operations | ||
| * on the queue with a cooperative group size of block_size | ||
| * | ||
| * @param block_size Size of the cooperative groups to calculate storage for | ||
| * @return The amount of temporary storage required in bytes | ||
| */ | ||
| __device__ int get_shmem_size(int block_size) const | ||
| { | ||
| int intersection_bytes = 2 * (block_size + 1) * sizeof(int); | ||
| int node_bytes = node_size_ * sizeof(T); | ||
| return intersection_bytes + 2 * node_bytes; | ||
| } |
| * @param shmem The shared memory layout for this cooperative group | ||
| * @param compare Comparison operator ordering the elements in the heap | ||
| */ | ||
| template <typename InputIt, typename T, typename Compare, typename CG> |
There was a problem hiding this comment.
OutputIt instead of InputIt
| * @param lowest_level_start The first index of the heaps lowest layer | ||
| * @param compare Comparison operator ordering the elements in the heap | ||
| */ | ||
| template <typename OutputIt, typename T, typename Compare> |
There was a problem hiding this comment.
| template <typename OutputIt, typename T, typename Compare> | |
| template <typename InputIt, typename viewT> |
| T* heap, | ||
| int* size, | ||
| std::size_t node_size, | ||
| int* locks, | ||
| std::size_t* p_buffer_size, | ||
| int lowest_level_start, | ||
| Compare compare) |
There was a problem hiding this comment.
| T* heap, | |
| int* size, | |
| std::size_t node_size, | |
| int* locks, | |
| std::size_t* p_buffer_size, | |
| int lowest_level_start, | |
| Compare compare) | |
| viewT view) |
The kernel implementation can also be simplified with view.
Co-authored-by: Yunsong Wang <[email protected]>
Co-authored-by: Yunsong Wang <[email protected]>
Co-authored-by: Yunsong Wang <[email protected]>
|
@andrewbriand Can you please also merge with the latest |
|
Hi guys. Is this gonna make it into the dev branch? |
@arsdever Thanks for following up. This work is currently on our backlog, and we’ll let you know once we have a more definite timeline for completion. |
Adds a GPU-accelerated priority queue
Allows for multiple concurrent insertions as well as multiple concurrent
deletions.
The implementation of the priority queue is based on https://arxiv.org/pdf/1906.06504.pdf.
The queue supports two operations:
push: Add elements into the queuepop: Remove the element(s) with the lowest (when Max == false) or highest(when Max == true) keys
The priority queue supports bulk host-side operations and more fine-grained
device-side operations.
The host-side bulk operations
pushandpopallow an arbitrary number ofelements to be pushed to or popped from the queue.
The device-side operations allow a cooperative group to push or pop
some number of elements less than or equal to node_size. These device side
operations are invoked with a trivially-copyable device view,
device_mutable_viewwhich can be obtained with the host functionget_mutable_device_viewand passed to the device.Current limitations:
TODO: Port tests to Catch2 and benchmarks to google benchmark